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Abstract —The growing rate of public space CCTV installations 
has generated a need for automated methods for exploiting 
video surveillance data including scene understanding, query, be¬ 
haviour annotation and summarization. For this reason, extensive 
research has been performed on surveillance scene understanding 
and analysis. However, most studies have considered single scenes, 
or groups of adjacent scenes. The semantic similarity between 
different but related scenes (e.g., many different traffic scenes of 
similar layout) is not generally exploited to improve any auto¬ 
mated surveillance tasks and reduce manual effort. Exploiting 
commonality, and sharing any supervised annotations, between 
different scenes is however challenging due to: Some scenes are 
totally un-related - and thus any information sharing between 
them would be detrimental; while others may only share a subset 
of common activities - and thus information sharing is only 
useful if it is selective. Moreover, semantically similar activities 
which should be modelled together and shared across scenes 
may have quite different pixel-level appearance in each scene. To 
address these issues we develop a new framework for distributed 
multiple-scene global understanding that clusters surveillance 
scenes by their ability to explain each other’s behaviours; and 
further discovers which subset of activities are shared versus 
scene-specific within each cluster. We show how to use this 
structured representation of multiple scenes to improve common 
surveillance tasks including scene activity understanding, cross¬ 
scene query-by-example, behaviour classification with reduced 
supervised labelling requirements, and video summarization. In 
each case we demonstrate how our multi-scene model improves 
on a collection of standard single scene models and a flat model 
of all scenes. 

Index Terms —Visual Surveillance,Transfer Learning, Scene 
Understanding, Video Summarization. 

I. Introduction 

HE widespread use of public space CCTV camera sys¬ 
tems has generated unprecedented amounts of data which 
can easily overwhelm human operators due to the sheer length 
of the surveillance videos and the large number of surveillance 
videos captured at different locations concurrently. This has 
motivated numerous studies into automated means to model, 
understand, and exploit this data. Some of the key tasks 
addressed by automated surveillance video understanding in¬ 
clude: (i) Behaviour profiling / scene understanding to reveal 
what are the typical activities and behaviours in the surveilled 
space E, E, 0 , 141, 0; (ii) Behaviour query by example, 
allowing the operator to search for similar occurrences to a 
specified example behaviour (T); (iii) Supervised learning to 
classify/annotate activities or behaviours if events of interest 
are annotated in a training dataset Q; (iv) Summarization to 
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give an operator a semantic overview of a long video in a 
short period of time (§1 and (v) Anomaly detection to highlight 
to an operator the most unusual events in a recording period 
E, 11, 0. So far, all of these tasks have generally been 
addressed within a single scene (single video captured by a 
static camera), or a group of adjacent scenes. 

Compared with single scene recordings, the multi-camera 
surveillance network (cameras distributed over different loca¬ 
tions) is a more realistic scenario in surveillance applications 
and thus of more interest to end users. An example of a 
multi-camera surveillance network is given in Fig [T] where 
surveillance videos capture mostly traffic scenes with various 
layouts and motion patterns. In such a multi-scene context, 
new surveillance tasks arise. For behaviour profiling / scene 
understanding, human operators would like to see which 
scenes within the network are semantically similar to each 
other (e.g. similar scene layout and motion patterns), which 
activities are in common - and which are unique - across 
a group of scenes, and how activities group into behaviours. 
Here activity refers to a spatio-temporally compact motion pat¬ 
tern due to the action of a single or small group of objects (e.g. 
vehicles making a turn) and behaviour refers to the interaction 
between multiple activities within a short temporal segment 
(e.g. horizontal traffic flow with vehicles going east and west 
and making a turn). For query-by-example, searching for a 
specified example behaviour should be carried out not only 
within scene but also across multiple scenes. For behaviour 
classification, annotating training examples in every scene 
exhaustively is not scalable. However multi-scene modelling 
potentially addresses this by allowing labels to be propagated 
from one scene to another. For summarization, generating 
a summary video for multiple scenes by exploiting cross¬ 
scene redundancy can provide the user who monitors a set of 
cameras with an overview of all the distinctive behaviours that 
have occurred in a set of scenes. Multi-scene summarisation 
can reduce the summary length and achieve higher compres¬ 
sion than single-scene summarization. Combined with query- 
by-example (find more instances of a behaviour in a summary), 
a flexible exploration of scenes at multiple scales is available. 

Despite the clear potential benefits of exploiting multi¬ 
scene surveillance, it can not be achieved with existing single¬ 
scene models E, E, E, 0, 0. These approaches learn 
an independent model for each scene and do not discover 
corresponding activities or behaviours across scenes even if 
they share the same semantic meaning. This makes any cross¬ 
scene reasoning about activities or behaviours impossible. In 
order to synergistically exploit multiple scenes in surveil¬ 
lance, a multi-scene model with the following capabilities 
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Fig. 1: An example of multi-camera surveillance network with 
camera views distributed across different locations. 


is required: (i) Learning an activity representation that can 
be shared across scenes; (ii) Model behaviours with the 
shared representation so they are comparable across scenes and 
(iii) Generalising surveillance tasks to the multi-scene case, 
including behaviour profiling/scene understanding, cross-scene 
query-by-example, cross-scene classification and multi-scene 
summarization. However this is intrinsically challenging for 
three reasons: 

1) Computing Scene Relatedness 

Determining the relatedness of scenes is critical for 
multi-scene modelling because naive information 
sharing between insufficiently related scenes can easily 
result in ‘negative transfer’ |7j, (8). However, the 
relatedness of scenes is hard to estimate because the 
appearance of elements in a scene (e.g. buildings, 
road surface markings, etc.) is visually diverse, and 
strongly affected by camera view, making appearance- 
based similarity measurement unreliable. Similarity 
measurement based on motion is less prone to visual 
noise in surveillance applications. However most studies 
only focus on discovering the similarity in activity level 
IE lf8l . Thus how to measure scene-level relatedness 
is still an open question. 

2) Selective sharing of information 

Large multi-camera surveillance networks covers 
various types of scenes. Some scenes are totally 
unrelated which means they convey different semantic 
meanings to a human. However, more subtly, even 
between similar scenes, there may be some activities 
in common and other activities that are unique to each. 
Learning a large universal model in this situation is 
prone to over-fitting due to the high model complexity. 
Hence a model that discovers (un)relatedness of scenes 
and selectively shares activities between them is 
necessary. 

3) Constructing a shared representation 

Within related scenes, a shared representation needs to 


be discovered in order to exploit their similarity for 
cross-scene query-by-example and multi-scene summa¬ 
rization. Both common and unique activities should 
be preserved in this process to ensure the ability of 
discovering not only the commonality but also the 
distinctiveness between scenes. 

To address these challenges we develop a new framework 
illustrated in Fig. [2] We first learn local representations for 
each scene separately. Then related scenes are discovered by 
clustering. A shared semantic representation is constructed 
to represent activities and behaviours within each group of 
related scenes. Specifically, we first represent each scene with 
a low-dimensional ‘semantic’ (rather than pixel level) repre¬ 
sentation through learning a fast unsupervised topic model for 
eaclQ Using a topic-based representation allows us to reduce 
the impact of pixel-noise in discovering activity and scene 
similarity. We next group semantically related scenes into a 
scene cluster by exploiting the correspondence of activities 
between different scenes. Finally, scenes within each cluster 
are projected to a shared representational space by computing 
a shared activity topic basis (STB), shared among all scenes 
but also allowing each scene to have unique topics if supported 
by the data. Behaviours in each scene are represented with the 
learned STB. 

In addition to profiling for revealing the multi-scene ac¬ 
tivity structure across all scenes, we use this structured rep¬ 
resentation to support cross-scene query, label-propagation 
for classification and multi-scene summarization. Cross-scene 
query by example is enabled because within each cluster, 
the semantic representation is shared, so an example in one 
scene can retrieve related examples in every other scene in 
the cluster. Behavior classification/annotation in a new scene 
without annotations is supported because, once associated to 
a scene cluster, it can borrow the label-space and classifier 
from that cluster. Finally, we define a novel jointly multi¬ 
scene approach to summarization that exploits the shared 
representation to compress redundancy both within and across 
scenes of each cluster. 

II. Related Work 

Surveillance Scene Understanding Scene understanding 
is a wide area that is too broad to review here. However, 
some relevant studies to this work include those based on 
object tracking 0 , he CD, CD, which model behaviours 
for example by Hidden Markov Model (HMM) 0, flOl . 
Gaussian Process m, clustering nn and stochastic context- 
free grammars lfl4l and those based on low-level feature 
statistics such as optical flow CD, E, 0, 0 that often 
model behaviours by probabilistic topic model (PTM) 0, 
El, 0. The latter category of approaches are the most related 
to ours, as we also built upon PTMs. However, all of these 
studies operate within-scene rather than modelling globally 
distributed scenes and discovering shared activities. 

topics have previously been shown to robustly reveal semantic activities 
from cluttered scenes □, 0. 
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Fig. 2: An illustration of the proposed framework. 


Multi-Scene Understanding We make an explicit distinc¬ 
tion to another line of work that discovers connections and 
correlations between multiple overlapping or non-overlapping 
scenes connected by a single camera network covering small 
areas fra , rm This is orthogonal to our area of interest, 
which is more similar to multi-task learning (3 - how to 
share information between multiple scenes some of which 
have semantic similarities, but do not necessarily concurrently 
surveil topologically connected zones. 

Fewer approaches have tried to exploit relatedness between 
scenes without a topological relationship El, 0. To recognize 
the same activity from another viewpoint, Khorkhar et al. 
0 proposed a geometric transformation based method to 
align two events, represented as Gaussian mixtures, before 
computing their similarity. Xu et al. 0 used a trajectory-based 
event description and learned motion models from trajectories 
observed in a source domain. This model was then used for 
cross-domain classification and anomaly detection. 

In the context of static image (rather than dynamic scene) 
understanding ns) , rm studies have clustered images by 
appearance similarity. However, this does not apply directly 
to surveillance scenes because the background is no longer 
stationary nor uniform, e.g. building and road appearance are 
visually salient but can vary significantly between surveil¬ 
lance scenes at different locations. It is not reliable to relate 
surveillance scenes based on appearance - the important cue 
is activity instead. 

Video Query and Annotation Video query has always been 
an important issue in surveillance applications. A lot of work 
has been done on semantic retrieval (20), d Hu et al. HI 
used trajectories to learn an activity model and construct 
semantic indices for video databases. Wang et al. (T) represents 
video clips as topic profiles and measures similarity between 
query and candidate clips as relative entropy. Retrieved clips 
are sorted according to the distance to the query. However 
none of these techniques take a multi-scene scenario into 
consideration, where query examples are selected in one scene 
and candidate clips can be retrieved from other scenes at 
different locations. 

Related to video query, video behaviour annota¬ 
tion/classification has been addressed in the literature 
m, also in terms of video segmentation (2ll . However, 
these approaches are typically domain/scene-specific, which 
means that each scene needs extensive annotation of training 
data; where ideally labels should instead be borrowed 
from semantically related scenes. Although a recent study 
0 recognised events across scenes at the activity level, 


scene level behaviour classification, and dealing with a 
heterogeneous database of scenes is still an open problem. 
Video Summarization Video summarization has received 
much attention in the literature in recent years due to the 
need to digest large quantities of video for efficient review 
by users. A review can be found in (22). There are a variety 
of approaches to summarization, varying both in how the 
summary is represented/composed, and how the task is 
formalised in terms of what type of redundancy should be 
compressed. 

Summaries have been composed by: static keyframes that 
represent the summary as a collection of selected key-frames 
(23), dynamic skimming which composes a summary based on 
a collection of selected clips, and more recently synopsis. Syn¬ 
opsis [24), 0 temporally re-orders (spatially non-overlapping) 
activities from the original video into a temporally compact 
summary video by shifting activity tubes temporally so they 
occur more densely. The objective of summarization can be 
formalised in various ways: to show all foreground activity 
in the shortest time (24) . to minimise the reconstruction error 
between the summary and the original video, to show at least 
one example of every typical behaviour, or more abstractly to 
achieve the highest rating in a user study (23) . 

As the number of scenes grows, multi-view summarization 
becomes increasingly important to help operators monitor 
activities in numerous scenes. However, multi-view summa¬ 
rization is much less studied compared to that of single view. 
Lou et al. (25 ) adopted multi-view video coding to deal with 
multi-view video compression, but did not tackle the more 
challenging compression of semantic redundancy. Fu et al. 
(26) addressed generating concise multi-view video summaries 
by multi-objective optimisation for generating representative 
summary clips. Recently, De Leo et al. ED proposed a multi¬ 
camera video summarization framework which summarizes at 
the level of activity motif (28) . Due to the severe occlusion, 
far-field of view and high density activities in surveillance 
videos, none of the existing techniques solve the problem of 
distributed multi-scene surveillance video summarization. 

In this paper, we pursue video summarization from the 
perspective of selecting the smallest set of representative video 
clips that still have good coverage of all the behaviours in the 
scene(s). Such multi-scene summarization compresses redun¬ 
dancy across as well as within scenes. This corresponds to an 
application scenario where the user tasked with monitoring 
a set of cameras wants an overview of all the behaviours 
that occurred in a set of video streams during a recording 
period regardless the source of the video recordings, which 
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typically come from different locations. This perspective on 
summarization is attractive because it makes sense of video 
content indepedent of location and local context. This offers a 
more holistic conceptual summarization in a global context 
as compared to summarization as visualisation of a single 
scene in a local context such as video synopsis. Interestingly, 
combined with our query-by-example, we can take a behaviour 
of interest shown in the summary as query to search for similar 
behaviours in other scenes. Thus the framework presents both 
compact multi-scene summarization and a finer scene-specific 
zoom-in, capable of compressing semantically equivalent ex¬ 
amples no matter what scene they occur in. 

Our Contributions A system based on our framework can 
answer questions such as 4 show me which scenes are similar 
to this?' (scene clustering), 4 show me which activities are 
in common and which are distinct between these scenes' 
(multi-scene profiling) 4 show me all the distinct behaviours in 
this group of scenes' (multi-scene summarization), 4 show me 
other clips from any scene that are similar to this nominated 
example' (cross-scene query), 4 annotate this newly provided 
scene with no-labels' (cross-scene classification). Specifically, 
we make the following key contributions: 

1) Introducing the novel and challenging problems of joint 
multi-scene modelling and analysis. 

2) Developing a framework to solve the proposed problem 
by discovering similarity between activities and scenes, 
clustering scenes based on semantic similarity and learn¬ 
ing a shared representation within scene clusters. 

3) We show how to exploit this novel structured multi¬ 
scene model for practical yet challenging tasks of cross¬ 
scene query-by-example and behaviour annotation. 

4) We further exploit this model to achieve multi-scene 
video summarization, achieving compression beyond 
standard single-scene approaches. 

5) We introduce a large multi-scene surveillance dataset 
containing 27 distinct views from distributed locations 
to encourage further investigation into realistic multi¬ 
scene visual surveillance applications. 

III. Learning Local Scene Activities 

Given a set of surveillance scenes we first learn local activi¬ 
ties in each individual scene using Latent Dirichlet Allocation 
(LDA) |[29l . Although there are more sophisticated single¬ 
scene models AD, ED, £Q, we use LDA because it is the 
simplest, most robust, most generally applicable to a wide 
variety of scene types, and the fastest for learning on large 
scale multi-scene data. However, it could easily be replaced by 
more elaborate topic models (e.g. HDP |T|). LDA generates a 
set of topics to explain each scene. Topics are usually spatially 
and temporally constrained sub-volumes reflecting the activity 
of a single or small group of objects. Following (l), O, we use 
activities to refer to topics and behaviours to refer to scene- 
level state defined by the coordinated activities of all scene 
participants. 

A. Video Clip Representation 

We follow the general approach (T) to construct visual 
features for topic models. For each video out of an M scene 


dataset we first divide the video frame into N a x N & cells 
with each cell covering H x H pixels. Within each cell we 
compute optical flow 130), taking the mean flow as the motion 
vector in that cell. Then we quantize motion vector into 7V m 
fixed directions. Note, stationary foreground objects can be 
readily added as another cell state as described in A3, ED- 
Therefore a codebook V of size N v = N a x 7V& x N m is 
generated by mapping motion vectors to discrete visual words 
(from 1 to N v ). Nd visual documents X = are 

then constructed by segmenting the video into non-overlapping 
clips of fixed length, where each clip xy = has Nj 

visual words xij . Clip and document are used interchangeably 
here with both indicating visual words accumulated in a 
temporal segment. 

B. Learning Local Activities with Topic Model 

Learning LDA for scene s discovers the dynamic ‘appear¬ 
ance’ of k = 1... K typical topics/activitie^] (multinomial 
parameter (3(f), and explains each visual word x s - in each clip 
x^ by a latent topic yL specifying which activity generated 
it, as shown in Fig. [3] The topic selection y\- is drawn from 
multinomial mixture of topics parametrized by 6 s which is 
further governed by a Dirchelet distribution with parameter 
a s . In scene 5 the joint probability of Nd visual documents 
X s = {x|}^f 1? topic selection Y s = {y j}f=i and topic 
mixture 6 s = {0*}^ given hyperparameters cx s and /3 s is: 

N d 

p(e s N s ^ s \af{3 s ) = l[p(0 s j \a. s y 

Nj 0=1 ( 1 ) 
f[p(y s ij I 0j)p( x tj I y!j,P s ) 

i=1 



Fig. 3: Graphical model for Latent Dirichlet Allocation. 

1) Model Inference: Exact inference in LDA is intractable 
due to the coupling between 0 and (3 [ 291. Variational infer¬ 
ence approximates a lower bound of log likelihood by intro¬ 
ducing variational parameters 7 and 0. Dirichlet parameter 
7 j is a clip-level topic profile and specifies the mixture ratio 
of each activity (3k in a clip xy. Thus, each video clip is 
represented as a mixture of activities ( 7 ^). The variational EM 
procedure for LDA is given in Algorithm [T] where l(-) is an 
indicator function and 4/(-) is the first derivative of the logT 

2 In text analysis, a topic refers to a group of co-occurring words in a 
document. Activity refers to a motion pattern, which defines the group of 
co-occurring visual words in a video clip. They are used interchangeably in 
the following text. 
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function. For efficiency, we apply the sparse updates identified 
in l32ll for an order of magnitude speed increase. 


Algorithm 1 Topic model learning for a single scene 

initialize = 1 

initialize (3 = random(N v , K) 

initialize 4>ijk = l/K 

repeat 

E-Step: 

for j = 1 -A Nd do 
for k = 1 -A K do 

'Ijk = Oik + ^2i=l 4*ijk 
for i = 1 -A Nj do 

4>ijk = Acy/teXpO^Gj-fc)) 

end for 
end for 
end for 
M-Step: 

for v = 1 -A N v do 
for k = 1 -A K do 

Pvk = Eiii (PijkHxij = v) 

end for 
end for 

until Converge 


After learning all s = 1... M scenes, every clip Xj is 
now represented as a topic profile 7 ?; and each scene is now 
represented by its constituent activities (3%. (Fig. [4]). 


IV. Multi-Layer Activity and Scene Clustering 

We next address how to discover related scenes and learn 
shared topics/activities across scenes. This multi-layer process 
is illustrated in Fig. [5] for two typical clusters 3 & 7: At 
the scene level we group related scenes according to activity 
correspondence (Section |iV-A ); within each scene cluster we 
further compute a shared activity topic basis so that all 
activities within that cluster are expressed in terms of the same 
set of topics (Section IV-B| ). 


A. Scene Level Clustering 

In order to group related scenes, we first need to define a 
relatedness metric. Related scenes should have more common 
activities so that the model learned from them is compact. 
So we assume the scenes with semantically similar activities 
are more likely to be mutually related. We thus define the 
relatedness between two (aligned) scenes a and b , by the 
correspondence of their semantic activities. 

a) Alignment: Comparing scenes directly suffers from 
cross-scene variance due to view angle. To reduce this cross¬ 
scene variance we first align two scenes with a geometrical 
transformation including scaling t s and translation [t x ,t y \. 
Although this is not a strong transform it is valid in the typical 
case that a camera is installed upright, and with surveillance 
cameras there are classic views which can be simply aligned 
by scaling and translation. To achieve this, we first denote the 
transform matrix for normalizing visual words in each scene 


a and b to the origin as T^ orm and T^ orm defined as Eq. ([ 2 ]). 
Scaling (£“) and translation ( t x ,t y ) parameters are estimated 
by Eq. 


rpa 

1 norm 


n 0 

0 t a 8 t a y 

001 


( 2 ) 


N d Nj 


center = 


C = 


j = 1 1=1 

N d • Nj 


Ef=i Ei=i \K-center || 


-viV. 


(3) 


L*VJ 


= —t c i • center 


Two scenes can thus be aligned by transforming data from a 
to b via T a2b = T^ m • T^ orm . We then denote kth topic 
in scene a as (3%. So any topic k in a can be aligned for 
comparison with those in b by T a2b . 

We denote the topic transformation procedure as (3' = 
H(/3;T). This transformation is applied to topics in a sim¬ 
ilar way as image transform. That is, given that f3 is a 
7V a x Nb x N m matrix and a transform matrix T is defined as 
Eq ([ 2 ]), we first estimate the size N^xN^x N m of transformed 
topic (3' by N' a = N a x t s and N' h = N b x t s . To obtain the 
value for each element/pixel of (3'(x r ,y f ,d'), we trace back 
to the position [x,y,d\ in the original topic (3. If we only 
consider scaling and translation, direction d is then unchanged 
throughout the procedure i.e. d! — d. Therefore, x and y are 
determined by: 


[x y 1] = w y' 1 ] • (T- 1 ) r (4) 


In most cases, x and y are not discrete values because 
of the matrix multiplication. In order to obtain the value for 
(3(x f ,y',d') 9 we perform interpolation, i.e. we use the values 
of adjacent pixels surrounding [x,y,d] to determine the value 
of (3(x' 9 y r , d'). This interpolation is only related to spatial 
values in a single layer, i.e. d is fixed, and we only use the 
adjacent pixels by varying x and y. A number of standard 
interpolation techniques can be used for this task including 
linear, bilinear and bicubic interpolations and we use bicubic 
interpolation here. After interpolation, we compute the exact 
value for each element/pixel (3(x f , y', d'). Due to that this 
transformation involves translation, the transformed topic (3' 
may extend out of the topic boundary, a N a x Nb rectangle, 
defined by the original topic (3. To ensure all topics being 
comparable with the same codebook size, we only keep the 
part of /3' that lies within the N a x N & rectangle defined by the 
original topic (3. After the above procedure, the transformed 
topic / 3' has the same size as the original (3, N a x x N m . 
Finally, we normalise the transformed topic (3' to obtain a 
multinomial distribution, as follows: 

= E ^E P'(x,v,d) (5) 

cc = l • • • N a y = 1 • • • Nfa d=l*- ■ iV m 
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Fig. 5: An illustration of multi-layer clustering of scenes and activities. Block I (Top) illustrates the original surveillance video 
scenes. Block II (middle) illustrates (i) related scenes are grouped into clusters (indicated by green dashed boxes) and (ii) the 
local topics/activities learned in each scene. Block III (bottom) illustrates (i) local topics are furthered grouped into activity 
clusters (color lines indicate some examples) and (ii) activity clusters are merged to construct a shared activity topic basis 
(STB). 


b) Affinity and clustering: Given the scene alignment 
above, we define the relatedness between scenes a and b by 
the percentage of corresponding topic pairs. More specifically, 
given K a local topics {/3 ka } k =1 in scene a and K b local 
topics {(3 b kh } kh=1 in scene b , the distancebetween topic /3 k 
and topic f3 b 


k b 


is defined as Prl in Eq. 


= ,(KL(/3? || 0 > kh ) + KL(/3* 


l«0) 


KL(/3] 


v=l \Pkt>v 


binarized. Topic pairs with distance less than a threshold are 
counted as inliers, defined by: 


Numlnlier = V" l(min(X> K L(^fea,^b)) < t) 

,—^ k b 

k a 

+ ^l(min(P KL (^,^)) < t ) 

k b 


(7) 


( 6 ) 


where !(•) is the indicator function. The final relatedness 
measure V(a,b) between scenes a and b is the percentage 
of inlier topic pairs: 


V(a,b) = 


Numlnlier 


( 8 ) 


Given a threshold r the similarity between two topics can be 


K a + K b 
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Since Eqs. [6] and [7] are symmetric, Eq. [8] is as well. Given 
this relatedness measure, every scene pair is compared to 
generate an affinity matrix, and self-tuning spectral clustering 
l33l is used to group scenes into c = 1 ... C semantically 
similar scene-level clusters. (See Fig. 0 ii for an example). 


B. Learning A Shared Activity Topic Basis 


Scenes clustered according to Section [TV- A| are semantically 
similar, however the representation in each is still distinct. We 
next show how to establish a shared representation for every 
scene in a particular cluster. We denote the set of scenes in 
a cluster as C. We first choose the scene with the lowest 
distance to all other scenes in the cluster as the reference 
scene/coordinate s re f. Activities in all scenes s G C can be 
projected to the reference coordinates via transform T s2s ' re f 
as stated in Eq. 

Vs e C, Vfc = 1... K : PI = H(/3£; T s2s 


Sref 


(9) 


Once every topic is in the same coordinate system, we create 
an affinity matrix for all the transformed topics {(3l} s ec us¬ 
ing the symmetrical Kullbeck-Leibler Divergence as distance 
metric (Eq. ([ 6 ])). Hierarchical clustering is then applied to 
group the projected activities into K stb clusters {Tk}k=i . 
(Tk denotes the set of activities in a cluster k). The result 
is that semantically corresponding activities across scenes are 
now grouped into the same cluster. We then take the mean 
of activities in each activity cluster Tk as one shared activity 


topic (3^ b as in Eq. dlOb. An alternative to this approach is 


to re-learn topics from the concatenation of visual words of 
all the scenes in a single cluster. However, this ‘Leaming- 
from-Scratch’ strategy prevents explicitly identifying shared 
and unique topics across scenes. Because the trace of local 
topics from individual scenes to STB is lost. In contrast, our 
framework reveals how scenes are similar or different. 


Vfc = 1... K stb : (3f b 


1 

vn\ 


E fa' 

k',s'eT k 


( 10 ) 


We denote the set of shared activity topics {0k tb }k=i 
learned for the cluster as the shared activity topic basis (STB). 
The resulting STB captures both common and unique activities 
in every scene member. See Fig. [5jll for an example. We can 
now represent the behaviours in every scene as STB profiles: 
by projecting the STB back to each scene and re-computing 
the topic profile 7 | tb defined now on ; in contrast 

to the original scene-specific representation ( 7 ?, defined in 
terms of {/3f.}^ =1 ). That is, re-running Algorithm [l| but with 
/ 3 fixed to the STB values obtained from Eq. ( p~Q]> . An example 
of behaviour profiling on STB is illustrated in Fig. [ 6 ] Visual 
words accumulated within a clip are profiled according to 
the STB. Thus each behaviour can be treated as a weighted 
mixture of multiple activities. 


V. Cross-Scene Query by Example and 
Classification 

Given the structured multi-scene model introduced in the 
previous section, we can now describe how cross-scene query 


and classification can be achieved. 


Cross-scene query Activity-based query by example aims 
at retrieving semantically similar clips to a given query clip. 
In the cross-scene context, the pool of potential clips to be 
searched for retrieval includes clips from every camera in the 
network. Within a scene cluster C, we segment each video s 
into j = 1... Nd short clips (Section III-A| ). We represent 
the jth video clip in scene s as topic profile / jj tb defined 
on STB (3 ^ tb . A query clip q, represented by STB profile 
jq tb can now be directly compared against all other clips 
in the cluster {7 fj?}j,s'ec using L2 distance. In this way, 
cross-scene query-by-example is achieved by sorting all clips 
in the cluster according to distance to the query. 


Cross-scene classification Given an existing annotated 
database of scenes modelled with our multi-layer framework, 
classification in a new scene s* can now be achieved without 
further annotation. First s* is associated to a cluster c* 
(Section |IV-A| ). Although s* has no annotation, this reveals a 
set of semantically corresponding existing scenes from which 
annotation can meaningfully be borrowed. Classification can 
thus be achieved by any classifier, using all other scenes/clips 
and labels from cluster c* as the labeled training set. 

It should be noted that our cross-scene classification differs 
from 134), (35) in: (1) We train on a set of source scenes 
before testing on a held-out scene rather than one source to one 
test scene. The conventional 1-1 approach requires implicitly 
the source and target scene to be relevant which must be 
manually identified. Our model is able to group relevant scenes 
automatically without requiring the user to know this as a 
priori. (2) Our model works in a transductive m manner. That 
is, it looks at target scene data during scene clustering, but 
without looking at the target data label. This weak assumption 
is more desirable in practice because surveillance video data 
is often easy to collect but without any labelling, whilst the 
effort required for labelling is the bottleneck. 


VI. Multi-Scene Summarization 

In this section we present a multi-scene video 
summarization algorithm that exploits the structure learned in 
Section [TV] to compress cross-scene redundancy. All clips are 
represented by their profile on STB. The general objective of 
multi-scene summarization is to generate a video skim with at 
least one example of each distinct behaviour in the shortest 
possible summary. We generate independent summaries 
for each scene cluster (since different scene clusters are 
semantically dissimilar), and multi-scene summaries within 
each cluster (since scenes within a cluster are semantically 
similar). 

K-center summaries: The multi-scene summary video is of 
configurable length N surn . Longer videos will show more 
distinct behaviours or more within-class variability of each 
behaviour. We compose the summary E of N surn clips 
{7 f b }ij £ £ drawn from all scenes in the cluster. The ob¬ 
jective is that all clips in the cluster { 7 f s b }j,sec should be 
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Fig. 6: An illustration of behaviour profiling on STB. In the left block, visual words are profiled by STB and plotted as coloured 
dots. Notice that colors here indicate visual words belonging to individual activities in STB instead of motion direction. Profiling 
7 is also given as bar chart where x axis indexes STB activities. The right block illustrates the STB activities where color 
patches indicate distribution of motion vectors. 


near to at least one clip in the summary (i.e., the summary is 
representative). Formally, this objective is to find the summary 


set E that minimizes the cost J in Eq. (11) where V 1 is the 
L2 distance: 


J = max I max 
j,sec V j'GE 


^7 (7 P b ,lfs b ) 


(ID 


This is essentially a k-center problem l36l . Since it is 
intractable to enumerate all combinations/potential summaries 
E, we adopt the 2-approximation algorithm m to this 
optimization. The resulting K = N surn centers identify the 
summary clips. 


VII. Experiments 

Dataset: We collected 25 real traffic surveillance videos 
from publicly accessible online web-cameras in Budapest, 
Hungary. These videos are combined with two surveillance 
video datasets Junction and Roundabout ED for a total of 
27 videos. Sample frames for each scene are illustrated in 
Fig. |7Ja). We trim each video to 18 000 frames in lOfps, of 
which 9 000 are used to learn the model and the remaining 
9 000 frames are used for testing (query, classification and 
summarization). For activity learning we segment each training 
video into 25 frame clips, so 360 clips are generated for 
each scene. For both query and summarization applications, 
we segment test videos into clips with 80 frames, so 112 
clips for query and summarization are generated from each 
scene. Thus, we have three types of video clips: (1) Clips 
for unsupervised training of LDA, (2) clips for training cross¬ 
scene classification, retrieval and multi-scene summarization, 
(Semantic Training Clips), (3) clips for testing cross-the same 
tasks (Semantic Testing Clips). LDA clips are shorter (25 
frames) to facilitate learning more cleanly segmented activ¬ 
ities. Semantic clips are longer (80 frames) as a more human- 
scale user-friendly unit for visualisation and annotation. 


Learning Activities: We computed optical flow [301 for 
all videos by quantizing the scenes with 5x5 pixel cells 
and 8 directions. Local activities are learned from each video 
independently using LDA with K — 15 activities per scene. 
Behaviour Annotation: Behaviour is a clip-level semantic 
tag defining the overall scene-activity. Due to the semantic 
gap between behaviours in the video clip and (potentially task 
dependent) human interpretation, it is difficult to give video 
a concise and consistent semantic label (in contrast to human 
action m and event (51 recognition). Instead of annotating 
each video clip explicitly, we give a set of binary activity tags 
(each representing the action of some objects within the scene) 
to each video clip as shown in Table [T] All the tags associated 
with vehicles have a sparse or dense option. When there are 
less than three vehicles travelling in a clip, it is labelled as 
sparse, otherwise dense. Each unique combination of activities 
that exists in the labelled clips then defines a unique scene- 
level behaviour category. We explore this through multiple sets 
of annotations: an original annotation with 19 distinct tags, 
and subsequent coarser label sets derived by merge scheme 
1 with 13 distinct tags and merge scheme 2 with 10 distinct 
tags. The activity tags are given in Table |T| We exhaustively 
annotate video clips in two example scene clusters (3 and 7 as 
shown in Lig. [7]). Across the two clusters, there are 6 scenes 
with 112 clips per scene annotated (672 clips in total). In 
the original annotation case, there are 111 total behaviours 
identified. The distribution of behaviours are illustrated in 
Lig.[8ja). However this number is more than necessary in terms 
of limited distinctiveness of the numerous entailed behaviours. 
By merging some activity annotations we generate 59 or 31 
(Merge Scheme 1 or 2 in Table [I]) unique behaviours. It should 
be noted that the frequency of behaviours is rather imbalanced, 
as indicated by all the subfigures of Lig. [8] There is also 
very limited overlap of behaviours between scene clusters 3 
and 7. To assess annotation consistency and bias, we invited 
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Scene 1 Scene 2 Scene 3 Scene 4 Scene 5 Scene 6 Scene 7 Scene 8 Scene 9 



Scene 10 Scene 11 Scene 12 Scene 13 Scene 14 Scene 15 Scene 16 Scene 17 Scene 18 



Fig. 7: Example frames for our multi-surveillance video dataset with each scene assigned a reference number on top of the 
frame. The color of bounding box and text in the bottom left indicates assigned cluster. 



(a) Behaviour frequency: original annotation 


^B Scene Cluster 3 
^B Scene Cluster 7 
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M Scene Cluster: 
^B Scene Cluster' 
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0 
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(b) Behaviour frequency: merge scheme 1 (c) Behaviour frequency: 

merge scheme 2 


TABLE I: Original annotation ontology and two merging 
schemes give multiple granularities of annotation. 


No. | Original Annotation | Merge Scheme 1 | Merge Scheme 2 


1 

Vehicle Left Sparse 

Vehicle Left 

Vehicle Horizontal 

2 

Vehicle Left Dense 

3 

Vehicle Right Sparse 

Vehicle Right 

4 

Vehicle Right Dense 

5 

Vehicle Up Sparse 

Vehicle Up 

Vehicle Vertical 

6 

Vehicle Up Dense 

7 

Vehicle Down Sparse 

Vehicle Down 

8 

Vehicle Down Dense 

9 

Vehicle Southeast Sparse 

Vehicle Southeast 

Vehicle SE& NW 

10 

Vehicle Southeast Dense 

11 

Vehicle Northwest Sparse 

Vehicle Northwest 

12 

Vehicle Northwest Dense 

13 

Vehicle Up2Right Turn 

Vehicle Up2Right Turn 

Vehicle Up2Right Turn 

14 

Vehicle Left2Up Turn 

Vehicle Left2Up Turn 

Vehicle Left2Up Turn 

15 

Vehicle Up2Left Turn 

Vehicle Up2Left Turn 

Vehicle Up2Left Turn 

16 

Tram Up 

Tram Up 

Tram Up 

17 

Tram Down 

Tram Down 

Tram Down 

18 

Pedestrian Horizontal 

Pedestrian Horizontal 

Pedestrian Horizontal 

19 

Pedestrian Vertical 

Pedestrian Vertical 

Pedestrian Vertical 


Cluster 11). 


Fig. 8: Frequencies of behaviours of each category, (a), (b) 
and (c) illustrate the frequency of behaviours when varying 
the labelling criteria. 


eight independent annotators to annotate all the video clips 
separately. We observe that the additional annotations are fairly 
consistent with the original annotation: with more than 80% 
agreement (Hamming distance) between the additional and 
the original annotations. Detailed analysis of these additional 
annotations are given in the supplementary material. 


Learning A Shared Activity Topic Representation: Within 
each scene cluster we unify the representation by computing a 
shared activity topic basis. We automatically set the number of 
shared activities K stb in each scene cluster with N s scenes as 
K stb = coeff x N s where coeff is set to 5. The discovered 
basis from an example cluster (Scene Cluster 3 shown in 
Fig|7]) with 4 scene members is illustrated in Fig. [9] This 
figure reveals both activities unique to each scene (Topics 1- 
15) and activities common among multiple scenes (Topic 16- 
20). Thus some shared activity topics are composed of single 
local/original topics, and others of multiple local topics. 


A. Multi-Layer Scene Clustering 


B. Cross-Scene Query by Example and Classification 


Scene Level Clustering: We first group the scenes into 
semantically similar clusters by spectral clustering. The 
similarity measurement between scenes is the number of 
corresponding activities, as defined in Section IV-A The 
self-tuning spectral clustering automatically determines the 
appropriate number of clusters which, in the case of our 
27-scene dataset, is 11 clusters. Fig. [7] shows the results, in 
which semantically similar scenes are indeed grouped (e.g. 
Camera towards one direction at road junctions in Cluster 3), 
and unique views are separated into their own cluster (e.g. 


In this section we evaluate the ability of our framework 
to support two tasks: cross-scene query by example; and 
cross-scene behaviour classification. We compare our Scene 
Cluster Model (SCM) with a baseline Flat Model (FM). 
Our Scene Cluster Model first group scenes into scene 
clusters according to their relatedness and learns STB for 
every scene cluster. Video clips in each scene cluster are 
thus represented as topic profiles on the STB of the scene 
cluster. As with our model, a Flat Model first learns a local 
topic model per scene, however it then learns a single STB 
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Composition of Shared Activity Topics in Scene Cluster 3 



Shared Original 

Activity Topics 1-5 
Topics 1-5 


Shared Original 

Activity Topics 6-10 
Topics 6-10 


Shared Original 

Activity Topics 11-15 
Topics 11-15 


Shared 
Activity 
Topics 16-18 


Original 
Topics 16-18 


Shared Original 

Activity Topics 18-20 

Topics 19-20 




Fig. 9: Example STB learned from Scene Cluster 3. Shared activity topics may be composed of one or more local/original 
topics. Original topics are overlaid on background frame. Color patches indicate distribution of motion vectors for a single 
activity. 



Fig. 10: Query by example MAP with different number of 
retrievals 


from all labelled scenes (6 scenes from 2 clusters) without 
scene level clustering, instead of one STB per-cluster. The 
only difference between SCM and FM is the absence of 
scene-level clustering in FM. Note that the Flat Model is a 
special case of our Scene Cluster Model with 1 scene-level 
cluster. Moreover, the individual scenes are also a special 
case of our Scene Cluster Model with one cluster per scene. 

Query by Example Evaluation: To quantitatively evaluate 
query by example, we exhaustively take each scene and each 
clip in turn as the query, and all other scenes are considered 
as the pool. All clips in the pool are ranked according to 
similarity (L2 distance on STB profile) to the query. Perfor¬ 
mance is evaluated according to how many clips with the same 
behaviour as the query clip are in the top T responses. We 
retrieve the best T = 1 • • • 200 clips and calculate the Average 
Precision of each category for each T. MAP is computed by 
taking the mean value of Average Precision over all categories. 
The MAP curve by the top T responses to a query for both 
Scene Cluster Model (SCM) and Flat Model (FM) and 
Merge Scheme 1 and 2 are plotted in Fig. [TO] It is evident that 
for both Merge Scheme 1 and 2, the proposed scene cluster 
model (SCM) performs consistently better than the Flat Model 
(FM) regardless of number of top retrievals T. This is because 
in the Scene Cluster Model, the STB learned from this set 



Fig. 11: Examples of cross-scene query by example. The first 
column gives 6 query clips randomly chosen from 6 scenes. 
The right image matrix illustrates the retrieved clips from the 
remaining 5 scenes, sorted by distance to query from left to 
right in the matrix. Color patches overlaid on the background 
indicates the visual words accumulated within a video clip. 


of scenes are highly relevant to each scene in the cluster. In 
contrast, the Flat Model learns a single STB for all scenes 
making the STB less relevant to each individual scene, hence 
less informative as a representation for retrieval. 

Qualitative results are also given in Fig. by presenting 
6 randomly chosen queries and their retrieved clips. Different 
types of behaviours are covered by query clips and most 
retrieved clips are semantically similar to query clips. The 
only exception is in the 3rd row where the query clip indicates 
traffic going east and turning from left to up. This is because 
there is no corresponding behaviour in the other scenes. 

Classification Evaluation: In this experiment we quantita- 
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tively evaluate classification performance where the test scene 
has no labels. Successful classification thus depends correctly 
finding semantically related scenes and appropriately trans¬ 
ferring labels from them (Section [V]). We perform leave one 
scene out evaluation by holding out one scene as the unlabelled 
testing set, and predicting the labels for the test set clips using 
the labels in remaining scenes using the KNN classifier. The 
KNN K parameter is determined by cross validating among 
the remaining scenes. Classification performance is evaluated 
by the accuracy for each category of behaviour, averaged over 
all held out scenes. 

TABLE II: Cross-scene classification accuracy with 31 and 
59 categories for both Scene Cluster Model (SCM) and Flat 
Model (FM). 


Category 

31 

59 

SCM | 

FM 

SCM | 

FM 

Scene 1 

55.36% 

50.89% 

42.86% 

40.18% 

Scene 2 

27.68% 

39.29% 

18.75% 

16.96% 

Scene 3 

49.11% 

41.96% 

39.29% 

37.50% 

Scene 4 

54.46% 

46.43% 

37.50% 

36.61% 

Scene 5 

30.36% 

26.79% 

17.86% 

17.86% 

Scene 6 

38.39% 

25.00% 

20.54% 

12.50% 

Average 

42.56% 

38.39% 

29.47% 

26.94% 


From Table [TT| we observe that at either granularity of 
annotation (59 or 31 categories), our Scene Cluster Model 
outperforms the Flat Model on average. This shows that again 
in order to borrow labels from other scenes for cross-scene 
classification, it is important to select relevant sources, which 
we achieve via scene clustering. The Flat Model is easily 
confused by the wider variety of scenes to borrow labels from, 
while our Scene Cluster Model structures similar scenes and 
borrows labels from only semantic related scenes to avoid 
‘negative transfer’ 171. f8l. 

C. Multi-Scene Summarization 

In the final experiment, we evaluate our multi-scene sum¬ 
marization model against a variety of alternatives. We con¬ 
sider two conditions: In the first, we consider multi-scene 
summarization within a scene cluster (Condition WC); in the 
second we consider unconstrained multi-scene summarization 
including videos spanning multiple scene clusters (Condition 
AC). 

Condition: Within-cluster summarization (WC) In this 
experiment we focus on the comparison between Multi-Scene 
Model and Single-Scene Model given various summarization 
algorithms. The Multi-Scene Model represents all video clips 
from different scenes within a cluster with a single STB 
learned from the scene cluster while the Single-Scene Model 
represents each video with scene specific activities and the 
overall summary is the mere concatenation of summaries 
from each scene. Specifically, we compare the summarization 
methods listed in Table [Till 

Condition: Across-cluster summarization (AC) In this 
experiment, analogous to query and classification, we focus 
on the comparison between Flat Model and Scene Cluster 
Model given different summarization algorithms. The Flat 
Model Learns a single STB from all scenes available without 


TABLE III: Summarization schemes for Condition WC 


Summarization 

Method 

Description 

Random 

This lower-bound picks clips randomly from multiple 
scenes to compose the summary 

Single-Scene 

Graph 

The overall summary is a concatenation of indepen¬ 
dent summaries for each video by doing recursive 
Normalized cut 1381 on a graph constructed by taking 
each video clip as vertices and L2 distance between 
topic profile 7 of each clip as edges. Here each video 
clip is represented by scene-specific local topics. 
This corresponds to 1391 but without temporal graph. 

Single-Scene 

Kcenter 

Similar to Single-Scene Graph method, but using 
Kcenter algorithm in Eq. (ll^j for summarization 
instead of Normalized Cut. 

Multi-Scene 

Graph 

This model learns a STB to represent video clips 
from all scenes with STB profile. Then Normalized 
Cut is applied to cluster clips and find multi-scene 
summaries. 

Multi-Scene 

Kcenter 

Our full model builds a STB from all scenes within 
a cluster, then uses the Kcenter algorithm to select 
summary clips from all scenes. 


discrimination while Scene Cluster Model learns a STB per 
scene cluster. Specifically, we compare the summarization 
schemes in Table HVl 


TABLE IV: Summarization schemes for Condition AC 


Summarization 

Method 

Description 

Random 

This picks clips randomly from multiple scenes to 
compose the summary 

Flat Multi-Scene 
User Attention 

Leverages the magnitude, spatial and temporal phase 
of optical flow vectors to index videos. This is the 
visual attention measurement of (1401, Eq. ( 6 )). We 
tested the model on a combined video by concate¬ 
nating each individual video. 

Flat Multi-Scene 
Graph 

This model uses Normalized Cut 1381 to cluster all 
video clips represented as single STB profiles. This 
is similar to f39j[. 

Flat Multi-Scene 
Kcenter 

Same as Flat Multi-Scene Graph , but using Kcenter 
to select summary clips. 

Scene Cluster 

Multi-Scene 

Kcenter 

Our full model clusters the scenes, learns STBs on 
each scene cluster, followed by Kcenter to sum¬ 
maries within each scene cluster 


Settings: To systematically evaluate summarization perfor¬ 
mance, we vary the length of the requested summary. In 
Condition WC the summary varies from 8 to 120 clips (64sec- 
onds to 16mins) out of overall 448 video clips (59.7mins) 
in Scene Cluster 3 (as shown in Fig. |7Ja)) and 224 video 
clips (29.9mins) in Scene Cluster 7. In Condition AC the 
summary varies from 6 to 120 clips (48seconds to 16mins) 
out of 672 video clips (89.7mins) total which is a combination 
of Scene Cluster 3 and 7. All video clips for summarization 
are represented as topic profile 7. Recall that each local scene 
is learned with K = 15 topics and scene clusters with N s 
scenes are learned with K = coeff x N s topics where coeff 
is set to 5 here. For fair comparison, flat model baselines are 
learned with the sum of the number of topics for each cluster. 

Summarization Evaluation The performance is evaluated by 
the coverage of identified behaviours in the summary, averaged 
over 50 independent runs. Fig. [I2ja ) and (b) show the results 
for multi-scene summarization within two example clusters 
(Condition WC). Clearly our Multi-Scene Kcenter algorithm 
(red) outperforms the baselines: both Graph Method alterna¬ 
tive (purple), and single-scene alternatives (dashed line). The 
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performance margin is greater between multi-scene and single¬ 
scene models for the first cluster because there are four scenes 
here, so greater opportunity to exploit inter-scene redundancy. 
This validates the effectiveness of jointly exploiting multiple- 
scenes for summarization. Fig. [T2fc) shows the result for 
multi-scene summarization across both clusters (Condition 
AC): our Scene Cluster Model builds one summary for each 
cluster to exploit the expected greater volume of within-cluster 
redundancy. In contrast, the Flat Model builds one single 
summary, but for a much more diverse group of data, and 
the single-scene models have no across-cluster redundancy 
to exploit. Even in the flat case, our Kcenter model (in 
green) still outperforms all other alternatives (purple and 
magenta). It is also worth noting that the user attention model 
degenerates severely on our dataset due to the inability to 
extract semantic meaning from videos where pure motion 
strength is not informative enough to distinguish semantic 
behaviours. Qualitative results for multi-scene summarization 
are presented in supplementary material. 




Generalised Scene Alignment We assume currently that 
cameras are installed upright and only scaling and translational 

transform are applied to scene alignment. However, under 
more generally, rotational transforms may also be considered. 
To that end, one can consider a generalised scene alignment 
that includes a rotational parameter </> in the transformation. 
Recall that in section |TV-A| we estimate the size of transformed 
topics. We can extend that to N' a = N a x t s x cos(</>) and 
N' h = Nb x t s x sin(cj)). The generalised transform matrix T 
is then defined as: 


T = 


t s ■ cos(0) 
t 8 ■ sin{(j)) 

0 


-t s • sin{4>) t x 

tg ■ CO<s((^)) ty 

0 1 


( 12 ) 


The procedure to transform a topic under this gener¬ 
alised alignment differs from the original alignment only 
in the estimation of direction d. To determine d given 
d', we represent quantized optical flow as vector vec' = 
[cos(2nd' /N^), sin(27rd f /N m )] T . Then we estimate the 
original flow vector vec = T* _1 uec' where T* is a 2 x 2 
matrix from the first two dimensions of T because translation 
does not change motion direction. We determine d by nearest 
neighbour as follows: 


d = argmin 

d=l---N rn 


vec — 


cos(2nd/N rri ) 
sin(2nd/N rn ) 


(13) 


To align scene A to scene B with this generalised alignment, 
we can estimate parameters by maximizing the marginal 
likelihood of target document Xb given source topics /3 a . 
Specifically, we denote the transform operation with speci¬ 
fied parameters as W(/3\t s , t x , t y , 0). Given target document 
Xb, the marginal likelihood is p(X\j\a a: W(l3 a \t s ,t x ,ty, $)) 
where a a is the Dirichlet prior in scene A. Because scal¬ 
ing and translational parameters are computed by a closed- 
form solution (Eq. (3}), we only need to search </> = 
argmax p(X\j\a a , H(/3 a \s,dx,dy,(f))). However, in our ex- 
0 

periments with applying this generalised alignment process, 
we observed many local minima - suggesting that the rota¬ 
tional transform is under-constrained, and not very repeatable. 



Fig. 12: Video summarization results: Coverage of behaviours 
versus summary clip length. 


D. Further Analysis 

In this section, we further analyse the robustness of our 
framework, by varying key parameters, and investigate their 
impact on model performance. 


Scene Alignment Stability We first evaluate the stability of 
scene-level alignment. Recall that given two scenes a and b , we 
firstly normalize each scene with geometrical transformation 
T norm an d T h n0 rm • The scene a to b transform is thus defined 
by: 


T 


a2b 


_ rp6—1 

norm 


T 


a 

norm 


z s n z x T x 

Tb U T77 Th 


t b s 

0 


t% _ tl_ 

t b s t b 

t a t B 

l y z v 

*2 *2 

1 


(14) 


We denote s a2b = dx a2b = ^ dy a2b = ^ 

The parameters estimated from full data in each scene are 
denoted as s^ 2b , dx^ b , dy^ 2b . To evaluate the stability of 
this alignment, we randomly sample 50% of the original data 
from each scene and estimate again the parameters as s^ b , 
dx^ Q 6 , dy^ b . We run this process for 20 times and calculate 
the Root Mean Square Error (RMSE), defined in Eq. (T~5j ) for 
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(a) Absolute reference (b) Absolute reference (c) Absolute reference 
value of scaling value of x translation value of y translation 



Fig. 13: Alignment and stability across all pairs of 27 scenes. 


„a2b 


. RMSE for dx and dy are defined in the same way by 


replacing s a2b with dx a2b and dy a2b respectively. 


RMSE(s) = J V £(«&* - O 2 


i= 1 


(15) 


We show both the absolute value of reference parameters 
and RMSE when aligning each pair of scenes in Fig [13] 
It is evident that most scene pairs are scaled between 0.7 


and 1.5 (Fig. [13[a)). The worst RMSE(s) among all scene 
pairs is 0.0007 (Fig. [l^d)). The same observations can be 
made on variability of x translation and y translation with 
the largest RMSE(dx) and RMSE(dy) being 0.035 pixels or 
less while the absolute value of reference x and y translation 
are between 0 and 20 pixels. The small values of these 
deviations verify that the scene alignment model is robust 
and repeatable. Some examples of scene alignment are shown 
in Fig. [14] Whilst the majority of activities are aligned well, 
some are less so. This is due to the limitation of a global rigid 
transform over a whole scene. Further extension could exploit 
individual activity centered alignment in addition to holistic 
scene alignment. 

Scene Cluster Stability We tested the stability of scene- 
level clustering by varying cell size, number of local top¬ 
ics, and clustering strategy: (1) We compared visual word 
quantisation with 5x5 and 10 x 10 cell size. (2) We 
evaluated from 5 to 30 local topics in each scene by step of 
5. (3) We performed self-tuning spectral clustering with two 
alternative settings. The first is that we allowed the model to 
automatically determine number of clusters and the second 
is that we fixed the number of clusters to the same as in 
the reference clustering, that is, 15 local topics and 5x5 
cell size. We measured the discrepancy between the results 
from automatic clustering and the reference clustering using 
the Rand Index ED. It describes the discrepancy between two 
set partitions and is frequently used as the evaluation metric 
for clustering. The Rand Index is between 0 and 1, with the 
higher value indicating more similar between two partitions. 
If two partitions are exactly the same, the Rand Index is 1. We 
show the results on the stability test of scene-level clustering 
in Fig [15] 



Scene 4 aligned to Scene 1 


Scene 4 aligned to Scene 2 


Scene 6 aligned to Scene 5 


Scene 5 aligned to Scene 1 


Fig. 14: Examples of scene alignment pairs. Each column indi¬ 
cates one example alignment. The first row is the target scene, 
the second row is the source scene to be aligned/transformed 
and the last row is the source scene after alignment to the 
target. Both within scene cluster (first three columns, clusters 
3, 3 and 7 respectively) and across cluster (fourth column, 
cluster 3 and 7) examples are presented. The overlaid heat 
map is the spatial frequency of visual words. 



(a) Rand Index(RI) cell size= 



- Rand Index with Auto Selected Cluster 
Rand Index with Fixed Cluster 

10 15 20 25 3 

Number of Local Topics 


(b) Number of cluster cell 
size=5 



(c) Rand Index(RI) cell size=10 (d) Number of cluster cell 

size=10 


Fig. 15: Stability of scene-level clustering. 


For both cell size = 5 and =10, automatic cluster selection 
generates consistent partitions (high Rand Index). So the 
framework is robust to motion quantisation cell size. However, 
it is also evident that automatic cluster number selection is 
less stable in determining the number of clusters as indicated 
by the red bars in FigjT5jb) and (d). On the other hand, by 
fixing the number of clusters, the partitioning is more stable 
(consistent high Rand Index). 

Associating New Scenes Our model is able to group scenes 
according to the semantic relatedness if all the recorded data 
are available in advance. In addition, the model is capable of 
associating new scenes to existing clusters, e.g. given input 
from newly installed cameras at different locations, without 
the need to completely re-learn the model. This is achieved 
by comparing the local topics of a new scene to the STB 
in each scene cluster and choosing the cluster with highest 
relatedness. Only the updated cluster needs to be re-learned to 
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Scene No. 

Fig. 16: Association of held out-scenes to clusters. Scene 1-4 
are held out from cluster 3, and scene 5-6 are held-out scenes 
from cluster 7. All held-out scenes are correctly associated. 


incorporate the new scene. We tested this approach in Scene 
Clusters 3 and 7 by: (1) Hold out each scene in turn as the 
candidate scene to be associated and learn STB in each cluster 
with the other scenes; (2) compute the relatedness between the 
held-out scene and both clusters using Eq. (3) associate 
the candidate scene to the cluster with the highest relatedness. 
We illustrate the result of this via the distance (defined as 1 — 
relatedness) between held-out candidate scenes and clusters 
in Fig. [16] It is evident that each held out scene is closer to 
its corresponding cluster, so 100% of scenes are associated 
correctly. However, this approach is limited to associating 
new scenes to existing scene clusters (scenes). A full online 
learning multi-scene model is desirable but also challenging 
and remains to be developed. 


STB Stability Finally, we investigate the stability of learning 
the Shared Topic Basis (STB) with different number of shared 
topics. Recall that, in section [YII-A| the number of STB topics 
for the Scene Cluster Model (SCM) and the Flat Model (FM) 
is K = coeff x N s . Now let us change coeff from 3 to 
10 and evaluate how this affects the cross-scene classification 
accuracy for both annotation Scheme 1 (59 categories) and 2 
(31 categories). The results are shown in Fig. [17] It is evident 


q j- J SCM Merge Sch 2 1 I FM Merge Sch 2 1 I SCM Merge Sch 1 FM Merge Sch 1 | _ 
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Fig. 17: Effect of varying number of topics used. Classification 
accuracy of Scene Cluster Model (SCM) and Flat Model (FM). 


that for both 59 and 31 categories, our Scene Cluster Model is 
mostly better than Flat Model over a range of topic numbers. 

VIII. Conclusions 

In this paper we introduced a framework for synergisti- 
cally modelling multiple-scene datasets captured by multi¬ 
camera surveillance networks. It deals with variable and 
piece-wise inter-scene relatedness by semantically clustering 
scenes according to the correspondence of semantic activities; 
and selectively shares activities across scenes within clusters. 
Besides revealing the commonality and uniqueness of each 


scene, multi-scene profiling further enables typical surveil¬ 
lance tasks of query-by-example, behaviour classification and 
summarization to be generalised to multiple scenes. Impor¬ 
tantly, by discovering related scenes and shared activities, it is 
possible to achieve cross-scene query-by-example (in contrast 
to typical within-scene query), and to annotate behaviour in 
a novel scene without any labels - which is important for 
making deployment of surveillance systems scale in practice. 
Finally, we can provide video summarization capabilities that 
uniquely exploit redundancy both within and across scenes by 
leveraging our multi-scene model. 

There are still several limitations to our work which can be 
addressed in the future: (i) In the current framework, scenes 
that can be grouped together are usually morphologically 
similar, which means the underlying motion patterns and view 
angles are essentially similar. More advanced geometrical 
registration techniques could be applied, including similarity 
and affine transformations, to allow scenes with more dramatic 
viewpoint changed to be grouped, (ii) In this work motion 
information is mostly contributed by traffic. However study¬ 
ing pedestrian/crowd behaviour is becoming more interesting 
1421 due to wide application in crime prevention and public 
security. However, compared with traffic, pedestrian crowd 
behaviours are less regulated and coherent. Thus, exacting 
suitable features and improving the model to deal with this 
are non-trivial tasks, (iii) Finally, an improved multi-scene 
framework that can fully incrementally add new scenes in an 
online manner is of interest. 
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